11 Data Visualization Fundamentals

One picture is worth a thousand words - Fred R. Barnard

Visual perception offers the highest bandwidth channel, as we acquire much more information through visual perception than with all of the other channels combined, as billions of our neurons are dedicated to this task. Moreover, the processing of visual information is, at its first stages, a highly parallel process. Thus, it is generally easier for humans to comprehend information with plots, diagrams and pictures, rather than with text and numbers. This makes data visualizations a vital part of data science. Some of the key purposes of data visualization are:

Data visualization is the first step towards exploratory data analysis (EDA), which reveals trends, patterns, insights, or even irregularities in data.
Data visualization can help explain the workings of complex mathematical models.
Data visualization are an elegant way to summarise the findings of a data analysis project.
Data visualizations (especially interactive ones such as those on Tableau) may be the end-product of data analytics project, where the stakeholders make decisions based on the visualizations.

11.1 Learning Objectives

By the end of this chapter, you will be able to:

Understand the fundamental principles of effective data visualization
Classify data types and choose appropriate visualization methods
Create basic plots using Pandas’ built-in plotting capabilities
Customize sophisticated visualizations with Matplotlib’s pyplot interface
Design publication-quality statistical plots using Seaborn
Apply best practices for visual storytelling with data
Compare different visualization libraries and their use cases

11.2 The Art of Visualization: Choosing the Right Plot Type

There are various types of plots available, and selecting the appropriate one is crucial for successful data visualization. The choice primarily depends on two factors:

The type of data you are working with, and
The role of visualization in your data analysis

11.2.1 Data Classification for Visualization

Data visualization is commonly used to plot data in a pandas DataFrame. The data can be classified into two categories:

Numeric Data: This type of data represents quantities and can take any value within a range. Common examples include age, height, temperature, etc.
Categorical Data: This type of data represents distinct categories or groups. It can be nominal (no inherent order, like colors or names) or ordinal (with a defined order, like ratings).

11.2.2 The Role of Visualization in Data Analysis

Data visualization is essential for effectively communicating insights derived from data analysis. By using various visualization techniques, we can uncover patterns, and understand relationships. Below, we discuss different types of data exploration and the relevant visualizations used for each.

11.2.2.1 Univariate Exploration

Purpose: Univariate exploration analyzes a single variable to understand its distribution, central tendency, and spread.

11.2.2.1.1 Common Visualizations:

Histograms: Display the frequency distribution of a numeric variable, helping to identify the shape of the data (e.g., normal, skewed).
Box Plots: Summarize key statistics of a variable, including median, quartiles, and potential outliers.
Bar Plots: Show the count or proportion of categorical variables, revealing the frequency of each category.
Line Plots: Used to display trends in numeric data over time, helping to visualize changes in a variable.

11.2.2.1.2 Insights Gained:

Identify outliers and anomalies.
Understand the range and distribution of values.
Determine central tendency (mean, median, mode).

11.2.2.2 Bivariate Analysis

Purpose: Bivariate analysis examines the relationship between two variables, helping to understand how changes in one variable might affect another.

11.2.2.2.1 Common Visualizations:

Scatter Plots: Illustrate the relationship between two numeric variables, highlighting trends and correlations.
Grouped Bar Plots: Compare categorical variables against a numeric variable, revealing trends across categories.
Heatmaps: Represent correlation coefficients between pairs of variables, allowing easy identification of strong correlations.

11.2.2.2.2 Insights Gained:

Assess the strength and direction of relationships (positive, negative, or no correlation).
Identify potential predictive relationships for further analysis.
Discover patterns that may indicate causal relationships.

11.2.2.3 Multivariate Analysis

Purpose: Multivariate analysis investigates more than two variables simultaneously, providing a comprehensive view of complex relationships and interactions.

11.2.2.3.1 Common Visualizations:

Pair Plots: Show pairwise relationships in a dataset, facilitating quick insights into correlations among multiple variables.
3D Scatter Plots: Visualize the interaction between three numeric variables in a three-dimensional space.
Facet Grids: Display multiple plots for different subsets of data, enabling comparisons across categories.

11.2.2.3.2 Insights Gained:

Understand interactions and dependencies among multiple variables.
Identify clusters or groups within the data.
Enhance predictive modeling by considering multiple influences.

11.2.3 Quick Reference: Data Types and Visualization Mapping

Data Type	Examples	Best Visualizations
Single Numeric	Age, Temperature, Price	Histogram, Box Plot, Density Plot
Single Categorical	Gender, Region, Product Type	Bar Chart, Pie Chart, Count Plot
Time Series	Stock prices, Daily sales	Line Plot, Area Chart
Two Numeric	Height vs Weight, Price vs Sales	Scatter Plot, Regression Plot
Numeric + Categorical	Sales by Region, Age by Gender	Grouped Bar, Box Plot, Violin Plot
Two Categorical	Gender vs Product Preference	Stacked Bar, Grouped Bar, Heatmap
Three+ Variables	Multiple dimensions	Pair Plot, Facet Grid, 3D Scatter
Correlations	Feature relationships	Correlation Matrix, Heatmap

Choosing the appropriate plot depends on the data type and the specific analysis purpose. Numeric data typically requires plots that can handle continuous data (like line plots or histograms), while categorical data often benefits from comparisons (like bar plots or pie charts). Always consider what story you want to tell with your data and select your visualization method accordingly.

11.3 Visualization Tools: The Python Ecosystem

Python offers a rich ecosystem of visualization libraries, each designed for specific purposes. In this course, we’ll focus on three fundamental libraries that form the foundation of data visualization in Python.

11.3.1 The Three-Library Approach

1. Pandas - Quick Exploratory Plots - Built-in .plot() method for DataFrames - Perfect for rapid data exploration - Minimal code required

2. Matplotlib - Complete Control - Low-level plotting library - Maximum customization capability - Foundation for other libraries

3. Seaborn - Statistical Beauty - High-level interface built on Matplotlib - Specialized for statistical visualization - Beautiful defaults out of the box

11.3.2 Why These Three?

Feature	Benefit
Complementary	Each library excels at different tasks
Integrated	They work seamlessly together
Industry Standard	Most widely used in data science
Well-Documented	Extensive resources and community support
Progressive Learning	Start simple (Pandas) → Advanced (Matplotlib/Seaborn)

11.3.3 Library Comparison at a Glance

Pandas Plot
├─ Pros: Fastest to write, DataFrame-native
└─ Cons: Limited customization, basic styling

Matplotlib
├─ Pros: Complete control, publication-quality
└─ Cons: Verbose syntax, requires more code

Seaborn
├─ Pros: Beautiful defaults, statistical functions
└─ Cons: Less control than Matplotlib, opinionated

11.3.4 Step 1: Import Libraries

Let’s start by importing all three libraries with their standard aliases:

import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

# We'll also import numpy for numerical operations
import numpy as np

print("✓ Visualization libraries imported and configured successfully!")
print(f"  - Pandas version: {pd.__version__}")
print(f"  - Matplotlib version: {matplotlib.__version__}")
print(f"  - Seaborn version: {sns.__version__}")
print(f"  - NumPy version: {np.__version__}")

✓ Visualization libraries imported and configured successfully!
  - Pandas version: 2.2.2
  - Matplotlib version: 3.8.4
  - Seaborn version: 0.13.2
  - NumPy version: 1.26.4

11.3.5 Step 2: Understanding the Imports

Let’s understand what each import does:

import pandas as pd - Imports the pandas library for data manipulation - Provides the .plot() method on DataFrames - pd is the universally recognized alias

import matplotlib.pyplot as plt - Imports matplotlib’s pyplot module for plotting - pyplot provides MATLAB-like interface for creating plots - plt is the standard alias used in all documentation

import seaborn as sns - Imports seaborn for statistical visualizations - Built on top of matplotlib with enhanced styling - sns alias comes from the TV show The West Wing

import numpy as np - Not a visualization library, but essential for: - Generating sample data for demonstrations - Performing numerical calculations - Creating arrays for plotting

11.3.6 Step 3: Configure Your Environment

Now let’s set up optimal defaults for our visualization environment:

# Configure matplotlib for better display in Jupyter
%matplotlib inline

# Set default figure size for all plots (width, height in inches)
import matplotlib
matplotlib.rcParams['figure.figsize'] = (8, 5)  # Larger default size
matplotlib.rcParams['figure.dpi'] = 100          # Resolution
matplotlib.rcParams['font.size'] = 11            # Default font size

# Configure Seaborn style
sns.set_style("whitegrid")                       # Clean grid background
sns.set_palette("deep")                          # Colorblind-friendly palette
sns.set_context("notebook")                      # Appropriate scaling

11.3.7 Step 4: Understanding the Configuration

Let’s break down what each configuration does:

11.3.7.1 Matplotlib Configuration

%matplotlib inline (Jupyter Magic Command)

Displays plots directly in the notebook
Without this, plots may not appear or open in separate windows
Only needed once per notebook session

figure.figsize - Default Plot Dimensions

Format: (width, height) in inches
Default is (6.4, 4.8) - often too small
We set (10, 6) for better visibility
Can be overridden for individual plots

figure.dpi - Dots Per Inch (Resolution)

Controls image quality and sharpness
Default: 72 DPI (screen display)
We set 100 DPI for crisper display
Use 300 DPI for publication-quality exports

font.size - Text Size

Default: 10 points
We set 11 for better readability
Affects all text: labels, titles, legends

11.3.7.2 Seaborn Configuration

set_style("whitegrid") - Visual Theme

Options: darkgrid, whitegrid, dark, white, ticks
whitegrid: Clean white background with subtle grid
Professional appearance suitable for presentations

set_palette("deep") - Color Scheme

Options: deep, muted, pastel, colorblind, etc.
deep: Saturated colors with good contrast
Automatically applied to Seaborn plots

set_context("notebook") - Scaling Preset

Options: paper, notebook, talk, poster
Controls relative size of elements
notebook: Optimal for Jupyter display
Use talk for presentations, poster for large displays

11.3.8 Step 5: Verify Your Setup

Let’s create a quick test to confirm everything is working correctly:

# Quick test plot to verify configuration
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Test 1: Pandas plot
test_data = pd.DataFrame({
    'x': range(10),
    'y': np.random.randn(10).cumsum()
})
test_data.plot(x='x', y='y', ax=axes[0], title='Pandas Plot', legend=False)
axes[0].set_ylabel('Value')

# Test 2: Matplotlib plot
axes[1].set_title('Matplotlib Plot')
axes[1].set_xlabel('X axis')
axes[1].set_ylabel('Y axis')
axes[1].grid(True, alpha=0.3)

# Test 3: Seaborn plot
test_df = pd.DataFrame({
    'category': ['A', 'B', 'C', 'D', 'E'],
    'values': np.random.randint(10, 50, 5)
})
sns.barplot(data=test_df, x='category', y='values', ax=axes[2])
axes[2].set_title('Seaborn Plot')

plt.tight_layout()
plt.suptitle('All Three Libraries Working!', fontsize=14, y=1.02)
plt.show()

print("\n✓ All libraries are working correctly!")
print("  You're ready to start creating visualizations!")


✓ All libraries are working correctly!
  You're ready to start creating visualizations!

11.3.9 Step 6: Pro Tips for Working with These Libraries

Now that everything is set up, here are some professional tips for effective visualization:

11.3.9.1 Layer Your Learning

Start simple and progressively add complexity:

Begin: Use Pandas .plot() for quick data exploration
Enhance: Add Matplotlib customization for specific needs
Refine: Use Seaborn for statistical visualizations
Combine: Mix all three in complex projects

11.3.9.2 Common Workflow Pattern

Here’s a typical pattern that combines all three libraries:

# Step 1: Create base plot with Seaborn (easy syntax + beautiful defaults)
sns.scatterplot(data=df, x='var1', y='var2', hue='category')

# Step 2: Customize with Matplotlib (fine-grained control)
plt.title('My Custom Title', fontsize=14, fontweight='bold')
plt.xlabel('Custom X Label')
plt.axhline(y=0, color='red', linestyle='--', alpha=0.5)

# Step 3: Display
plt.show()

11.3.9.3 Troubleshooting Quick Reference

Problem	Solution
Plots not showing	Add `plt.show()` or verify `%matplotlib inline` is set
Plots overlapping	Use `plt.figure()` before each new plot or `plt.clf()` to clear
Text too small	Increase `font.size` in rcParams or use `fontsize` parameter
Poor image quality	Save with higher DPI: `plt.savefig('plot.png', dpi=300)`
Colors hard to distinguish	Use colorblind-friendly palette: `sns.set_palette("colorblind")`
Legend outside plot area	Adjust with `bbox_inches='tight'` when saving

11.3.9.4 Performance Tips for Large Datasets

When working with datasets >100K points:

Sampling: Plot a representative random subset
Aggregation: Group/bin data before plotting
Rasterization: Use rasterized=True in scatter plots
Alternative libraries: Consider Plotly or Bokeh for interactive visualizations

11.3.9.5 Saving High-Quality Plots

Professional way to save your visualizations:

plt.savefig('my_plot.png', 
            dpi=300,                    # Publication quality (300 DPI)
            bbox_inches='tight',        # Remove extra whitespace
            facecolor='white',          # Ensure white background
            edgecolor='none',           # No border
            transparent=False)          # Solid background

11.3.10 You’re Ready!

Your visualization environment is now fully configured and tested. In the following sections, we’ll explore each library in detail, starting with Pandas for quick exploratory analysis.

11.4 Basic Plotting with Pandas

In previous chapters, we focused on using pandas for data reading and analysis. In addition to its powerful data manipulation capabilities, pandas also provides built-in plotting tools that make it especially valuable for:

Rapid Exploration - Create plots with minimal code
Seamless Integration - No data conversion needed
Quick Insights - Perfect for initial data investigation
Iterative Analysis - Easy to modify and regenerate

The Power of .plot(): Pandas’ .plot() method is your Swiss Army knife for quick visualizations. It’s built on Matplotlib but provides a simpler, DataFrame-centric interface.

11.4.1 📂 Dataset: COVID-19 Cases

In this section, we’ll use real COVID-19 data to demonstrate practical plotting techniques. This dataset contains time series information about new cases and deaths, making it ideal for learning data visualization.

covid_df = pd.read_csv('./Datasets/covid.csv')
covid_df.head(5)

	date	new_tests	total_tests	tests_per_million
0	2019-12-31	NaN	NaN	NaN
1	2020-01-01	NaN	NaN	NaN
2	2020-01-02	NaN	NaN	NaN
3	2020-01-03	NaN	NaN	NaN
4	2020-01-04	NaN	NaN	NaN

11.4.2 Step 1: Your First Pandas Plot

Let’s start with the simplest possible plot - visualizing a single variable.

Concept: Line Plots for Time Series Line plots are ideal for showing changes over continuous data, such as the progression of new cases over a series of dates.

The Simplest Plot: With pandas, creating a plot is as easy as calling .plot() on a Series or DataFrame:

covid_df.new_cases.plot()

Problem Identified: 🤔

While this plot shows the overall trend, it’s hard to tell when the peak occurred - the X-axis shows indices (0, 1, 2…) instead of dates!

Solution: Set the Date as Index

Since this is a time series dataset, we can use the date column as the DataFrame index. This tells pandas to use actual dates on the X-axis:

covid_df.set_index('date', inplace=True)

covid_df.new_cases.plot(rot=45);

Much Better!

Now we can clearly see that the peak occurred around March 2020. The rot=45 parameter rotates the X-axis labels 45 degrees for better readability.

Key Insight: When working with time series data, always set the date/time column as the index for automatic, proper time-axis formatting.

11.4.3 Step 2: Plotting Multiple Variables

Now that we have one variable plotted, let’s compare multiple variables on the same plot to see relationships.

Goal: Compare new cases and new deaths trends over time

How: Simply pass a list of column names to the DataFrame:

covid_df[['new_deaths', 'new_cases']].plot();

What Happened?

Pandas automatically:

Created two lines (one per column)
Assigned different colors to each line
Generated a legend with column names
Used the index (dates) for the X-axis

11.4.4 Step 3: Customizing Your Plots

By default, pandas generates line plots using the .plot() method. However, you can adjust several parameters to enhance the appearance:

Common Customization Parameters:

Parameter	Purpose	Example
`figsize`	Set figure dimensions (width, height) in inches	`figsize=(12, 6)`
`linewidth` or `lw`	Control line thickness	`linewidth=2`
`marker`	Add data point markers	`marker='o'`
`color`	Set line color(s)	`color='red'` or `color=['red', 'blue']`
`alpha`	Set transparency (0=transparent, 1=opaque)	`alpha=0.7`
`rot`	Rotate X-axis labels	`rot=45`
`grid`	Show/hide gridlines	`grid=True`
`title`	Add plot title	`title='My Plot'`

Let’s enhance our plot:

covid_df[['new_deaths', 'new_cases']].plot(figsize=(12, 6), linewidth=2, marker='o');

Result: The plot is now larger, has thicker lines, and shows markers at each data point - much easier to read!

11.4.5 Step 4: Different Plot Types

So far, we’ve only created line plots (the default). However, pandas supports 11 different plot types through the kind parameter:

`kind` Value	Plot Type	Best For
`'line'`	Line plot (default)	Trends over time, continuous data
`'scatter'`	Scatter plot	Relationships between two variables
`'bar'`	Vertical bar chart	Comparing categories
`'barh'`	Horizontal bar chart	Comparing categories (long labels)
`'hist'`	Histogram	Distribution of a single variable
`'box'`	Box plot	Distribution summary, outlier detection
`'kde'` or `'density'`	Kernel Density Estimate	Smooth distribution visualization
`'area'`	Area plot	Cumulative trends
`'pie'`	Pie chart	Part-to-whole relationships
`'hexbin'`	Hexagonal bin plot	Dense scatter plot patterns

Let’s explore the most common plot types:

11.4.5.1 Scatter Plot: Finding Relationships

Question: Is there a correlation between new cases and new deaths?

Best Tool: Scatter plot - shows the relationship between two numerical variables

Let’s next create a scatter plot to visualize the relationship between new cases and new deaths, and explore whether there’s a correlation between them.

covid_df.plot(kind='scatter', x='new_cases', y='new_deaths', color='r', alpha=0.5);

Observation: There appears to be a positive correlation - as new cases increase, new deaths tend to increase as well. The alpha=0.5 parameter makes points semi-transparent, helping us see overlapping data points.

11.4.5.2 Histogram: Understanding Distribution

Question: What’s the typical range of daily deaths? Are there outliers?

Best Tool: Histogram - shows the frequency distribution of a single variable

covid_df.new_deaths.plot(kind='hist', color='r', alpha=0.5, bins=50);

Observation: Most days have relatively few deaths (left-skewed distribution), with occasional days having much higher counts. The bins=50 parameter creates 50 bins for finer detail.

11.4.5.3 Box Plot: Spotting Outliers

Question: What’s the median, quartiles, and outliers for new cases?

Best Tool: Box plot - summarizes distribution and highlights outliers

covid_df.new_cases.plot(kind='box');

Observation: The box shows the median (orange line), 25th-75th percentiles (box), and outliers (circles). We can see several high-outlier days with exceptionally high case counts.

11.4.6 Step 5: Choosing the Right Plot Type

Not sure which plot to use? Follow this decision guide:

📈 WANT TO SHOW TRENDS OVER TIME?
   └─> Use: kind='line' (default)
   └─> Example: Stock prices, temperature changes

🔍 WANT TO SHOW RELATIONSHIPS BETWEEN TWO VARIABLES?
   └─> Use: kind='scatter'
   └─> Example: Height vs. weight, study time vs. grades

📊 WANT TO SHOW DISTRIBUTION OF ONE VARIABLE?
   └─> Use: kind='hist' or kind='box'
   └─> Histogram: See the shape of distribution
   └─> Box plot: See median, quartiles, and outliers

📊 WANT TO COMPARE CATEGORIES?
   └─> Use: kind='bar' or kind='barh'
   └─> Example: Sales by region, counts by category

🥧 WANT TO SHOW PARTS OF A WHOLE?
   └─> Use: kind='pie'
   └─> Warning: Use sparingly - often harder to read than bar charts!

11.4.7 Additional Resources

For more plot types and detailed information, refer to the official pandas documentation:

Series.plot - Plotting methods for Series
DataFrame.plot - Plotting methods for DataFrames

11.4.8 Summary: Pandas Plotting Strengths & Limitations

11.4.8.1 ✅ Strengths (When to Use Pandas):

Speed - Create plots in 1-2 lines of code
Integration - Works directly with DataFrames (no conversion)
Exploration - Perfect for quick data checks
Learning Curve - Easiest to learn for beginners
Iteration - Rapidly test different visualizations

Best Use Cases: - Rapid data exploration during analysis - Quick sanity checks during data cleaning - Initial pattern recognition - Jupyter notebook investigations

11.4.8.2 ⚠️ Limitations (When to Move Beyond Pandas):

Limited Aesthetics - Basic styling, not publication-ready
Simple Layouts - Hard to create multi-panel figures
Customization - Fine-grained control requires Matplotlib
Statistical Plots - No built-in statistical visualizations
Plot Variety - Missing advanced plot types

When to Switch Libraries:

→ Matplotlib: Need fine-grained control, custom layouts, publication quality
→ Seaborn: Need statistical plots, beautiful defaults, correlation matrices
→ Plotly: Need interactive plots, dashboards, 3D visualizations

11.4.9 Practice Exercise: Apply Your Skills

Now it’s your turn! Try creating these plots with the COVID dataset:

Exercise 1: Basic Line Plot

# Plot new_cases with:
# - Figure size of (14, 6)
# - Green color
# - Title "COVID-19 New Cases Over Time"

Exercise 2: Comparison Plot

# Plot new_cases and new_deaths together with:
# - Different line styles (solid and dashed)
# - Add a grid

Exercise 3: Distribution Analysis

# Create a histogram of new_cases with:
# - 30 bins
# - Blue color with 50% transparency

Exercise 4: Relationship Analysis

# Create a scatter plot showing new_cases vs new_deaths
# Add appropriate axis labels

💡 Tip: Refer to the DataFrame.plot documentation to find the parameters you need!

11.5 Data Plotting with Matplotlib pyplot Interface

You’ve just learned to create plots with pandas’ .plot() method. But here’s an important insight: Pandas plotting is built on top of Matplotlib!

When you call df.plot(), pandas is actually calling Matplotlib functions behind the scenes. Think of it like this:

┌─────────────────────────────────────┐
│     Pandas .plot()                  │  ← High-level, easy, limited control
├─────────────────────────────────────┤
│     Matplotlib (pyplot)             │  ← Low-level, powerful, full control
└─────────────────────────────────────┘

Why Learn Matplotlib Directly?

While Pandas is great for quick plots, Matplotlib gives you:

Maximum Control - Customize every element of your plots
Complex Layouts - Create multi-panel figures and subplots
Publication Quality - Professional scientific visualizations
Foundation Knowledge - Understanding how visualization really works in Python
Flexibility - Work with any data structure (lists, arrays, DataFrames)

11.5.1 What is Matplotlib?

Matplotlib is:

A comprehensive library for creating static, animated, and interactive visualizations in Python
Designed to emulate MATLAB’s plotting interface (hence the name)
The foundation for many other visualization libraries (Pandas, Seaborn, etc.)
Compatible with Python scripts, IPython shells, Jupyter notebooks, and web servers
Mostly written in Python, with some segments in C, Objective-C, and JavaScript for platform compatibility

11.5.2 Step 1: Understanding pyplot

What is pyplot?

pyplot is a module within Matplotlib that provides a state-based interface for creating plots. It’s the most commonly used part of Matplotlib.

Key Concepts:

Matplotlib = The entire plotting library/package
pyplot = A specific module within Matplotlib that makes plotting easier
plt = The conventional alias used when importing pyplot

The State-Based Interface:

pyplot maintains an internal current figure and axes. When you call functions like plt.plot(), plt.xlabel(), etc., they automatically apply to the current figure. This makes it easy to build plots step by step without explicitly managing figure objects.

Import Convention:

The standard way to import pyplot is:

import matplotlib.pyplot as plt

Data Sources:

pyplot can work with various data types: - Python lists - NumPy arrays - Pandas Series and DataFrames

Note: Internally, all sequences are converted to NumPy arrays for processing.

11.5.3 Step 2: Your First pyplot Plot

Let’s start with the simplest possible plot using Python lists to illustrate basic plotting with Matplotlib pyplot:

yield_apples = [0.895, 0.91, 0.919, 0.926, 0.929, 0.931]

Create the Plot:

Now let’s visualize this data with a simple line plot:

plt.plot(yield_apples);

What Just Happened?

Calling plt.plot() creates a line chart showing the trend in apple yields.

The Semicolon Trick:

You might notice the semicolon (;) at the end of the command. Without it, Matplotlib returns a text representation like [<matplotlib.lines.Line2D at 0x2194b571df0>] before displaying the plot. The semicolon suppresses this output, showing only the graph.

With semicolon (cleaner output):

plt.plot(yield_apples);

Problem: The X-axis shows list indices (0, 1, 2, 3, 4, 5) instead of meaningful values like years.

Solution: Provide both X and Y data to plt.plot().

11.5.4 Step 3: Customizing Axes

Let’s make our plot more informative by specifying custom X-axis values. We’ll use years instead of indices:

years = [2010, 2011, 2012, 2013, 2014, 2015]
yield_apples = [0.895, 0.91, 0.919, 0.926, 0.929, 0.931]

plt.plot(years, yield_apples);

Much Better! Now the X-axis shows actual years, making the plot meaningful.

Syntax: plt.plot(x_values, y_values)

11.5.5 Step 4: Plotting Multiple Lines

You can create multiple lines on the same plot by calling plt.plot() multiple times. Each call adds another line to the current figure.

Use Case: Let’s compare the yields of apples vs. oranges over time.

years = range(2000, 2012)
apples = [0.895, 0.91, 0.919, 0.926, 0.929, 0.931, 0.934, 0.936, 0.937, 0.9375, 0.9372, 0.939]
oranges = [0.962, 0.941, 0.930, 0.923, 0.918, 0.908, 0.907, 0.904, 0.901, 0.898, 0.9, 0.896, ]

plt.plot(years, apples)
plt.plot(years, oranges);

Result: Matplotlib automatically assigns different colors to each line and displays both on the same axes.

Default Settings:

When plt.plot() is called without formatting parameters, pyplot uses these defaults:

Property	Default Value
Figure size	6.4 × 4.8 inches
Line style	Solid line
Line width	1.5
First line color	Blue (#1f77b4)
Subsequent colors	Automatic color cycle

Customizing Global Defaults:

You can change default settings for all future plots using matplotlib.rcParams:

import matplotlib
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (7, 4)
matplotlib.rcParams['figure.facecolor'] = '#00000000'

Note: While we’re focusing on the pyplot interface in this section, Matplotlib also offers an object-oriented interface that provides even more control. We’ll explore that in the next chapter.

For more on customizing defaults, see: Customizing Matplotlib with rcParams

11.5.6 Step 5: Adding Labels, Titles, and Legends

A good plot needs clear labels so readers understand what they’re looking at. Let’s enhance our plot with descriptive text elements.

Every Complete Plot Should Have:

Title - What the plot shows
Axis Labels - What each axis represents (with units!)
Legend - Which line is which (when plotting multiple series)

11.5.6.1 Adding Axis Labels

Let’s start by adding labels to our axes using plt.xlabel() and plt.ylabel():

plt.plot(years, apples)
plt.plot(years, oranges)
plt.xlabel('Year')
plt.ylabel('Yield (tons per hectare)');

Good! But which line is which? Let’s add a title and legend.

11.5.6.2 Adding Title and Legend

Use plt.title() for the title and plt.legend() to identify each line:

plt.plot(years, apples)
plt.plot(years, oranges)

plt.xlabel('Year')
plt.ylabel('Yield (tons per hectare)')

plt.title("Crop Yields in Kanto")
plt.legend(['Apples', 'Oranges']);

Perfect! Now our plot is fully labeled and easy to understand.

11.5.7 Step 6: Styling Lines and Markers

Matplotlib offers extensive customization for how lines and markers appear. Let’s explore the options.

11.5.7.1 Adding Markers

Show data points explicitly using the marker parameter. Matplotlib provides many marker styles:

'o' = circle
'x' = X mark
's' = square
'^' = triangle up
'*' = star
'+' = plus sign

See the full list of markers.

plt.plot(years, apples, marker='o')
plt.plot(years, oranges, marker='x')

plt.xlabel('Year')
plt.ylabel('Yield (tons per hectare)')

plt.title("Crop Yields in Kanto")
plt.legend(['Apples', 'Oranges']);

Great! The markers make it easier to see individual data points.

11.5.7.2 Complete Styling Options

The plt.plot() function supports many styling arguments:

Parameter	Alias	Purpose	Example
`color`	`c`	Set line color	`color='red'` or `c='#FF0000'`
`linestyle`	`ls`	Line style (solid, dashed, etc.)	`linestyle='--'` or `ls=':'`
`linewidth`	`lw`	Line thickness	`linewidth=2` or `lw=3`
`marker`	-	Marker style	`marker='o'`
`markersize`	`ms`	Marker size	`markersize=10` or `ms=8`
`markeredgecolor`	`mec`	Marker edge color	`mec='navy'`
`markeredgewidth`	`mew`	Marker edge thickness	`mew=2`
`markerfacecolor`	`mfc`	Marker fill color	`mfc='lightblue'`
`alpha`	-	Transparency (0=invisible, 1=opaque)	`alpha=0.7`

Documentation: plt.plot() reference

Example with Multiple Styling Parameters:

plt.plot(years, apples, marker='s', c='b', ls='-', lw=2, ms=8, mew=2, mec='navy')
plt.plot(years, oranges, marker='o', c='r', ls='--', lw=3, ms=10, alpha=.5)

plt.xlabel('Year')
plt.ylabel('Yield (tons per hectare)')

plt.title("Crop Yields in Kanto")
plt.legend(['Apples', 'Oranges']);

Impressive! We’ve created a highly customized plot with thick lines, custom colors, different markers, and transparency.

11.5.7.3 The fmt Shorthand

For quick styling, Matplotlib provides the fmt string format as a shorthand:

Syntax: fmt = '[marker][line][color]'

Common Examples:

'o-r' = circle markers, solid line, red color
'x--b' = X markers, dashed line, blue color
'^:g' = triangle markers, dotted line, green color
'sb' = square markers, no line, blue color

Color Codes:

'b' = blue, 'g' = green, 'r' = red, 'c' = cyan
'm' = magenta, 'y' = yellow, 'k' = black, 'w' = white

Line Styles:

'-' = solid, '--' = dashed, ':' = dotted, '-.' = dash-dot

Example:

plt.plot(years, apples, 's-b')
plt.plot(years, oranges, 'o--r')

plt.xlabel('Year')
plt.ylabel('Yield (tons per hectare)')

plt.title("Crop Yields in Kanto")
plt.legend(['Apples', 'Oranges']);

Much simpler code! The fmt string 's-b' means square markers, solid line, blue and 'o--r' means circle markers, dashed line, red.

11.5.7.4 Markers Without Lines

If you don’t specify a line style in fmt, only markers are drawn (no connecting lines):

plt.plot(years, apples, 'sb')
plt.plot(years, oranges, 'or')
plt.title("Yield (tons per hectare)");

Result: Only markers, no lines. Perfect for scatter-like visualizations!

11.5.8 Step 7: Controlling Figure Size

By default, Matplotlib creates figures that are 6.4 × 4.8 inches. You can change this using plt.figure(figsize=(width, height)):

plt.figure(figsize=(6, 4))

plt.plot(years, oranges, 'or')
plt.title("Yield of Oranges (tons per hectare)");

Important: Call plt.figure() BEFORE plt.plot() to set the size for that specific figure.

11.5.9 Step 8: Other Plot Types with pyplot

So far we’ve focused on line plots, but pyplot supports many other visualization types. Let’s explore them using a different dataset.

New Dataset: FIFA Player Data

Let’s load FIFA player statistics to demonstrate various plot types:

fifa = pd.read_csv('./Datasets/fifa_data.csv')
fifa.head(5)

	Unnamed: 0	ID	Name	Age	Photo	Nationality	Flag	Overall	Potential	Club	...	Composure	Marking	StandingTackle	SlidingTackle	GKDiving	GKHandling	GKKicking	GKPositioning	GKReflexes	Release Clause
0	0	158023	L. Messi	31	https://cdn.sofifa.org/players/4/19/158023.png	Argentina	https://cdn.sofifa.org/flags/52.png	94	94	FC Barcelona	...	96.0	33.0	28.0	26.0	6.0	11.0	15.0	14.0	8.0	€226.5M
1	1	20801	Cristiano Ronaldo	33	https://cdn.sofifa.org/players/4/19/20801.png	Portugal	https://cdn.sofifa.org/flags/38.png	94	94	Juventus	...	95.0	28.0	31.0	23.0	7.0	11.0	15.0	14.0	11.0	€127.1M
2	2	190871	Neymar Jr	26	https://cdn.sofifa.org/players/4/19/190871.png	Brazil	https://cdn.sofifa.org/flags/54.png	92	93	Paris Saint-Germain	...	94.0	27.0	24.0	33.0	9.0	9.0	15.0	15.0	11.0	€228.1M
3	3	193080	De Gea	27	https://cdn.sofifa.org/players/4/19/193080.png	Spain	https://cdn.sofifa.org/flags/45.png	91	93	Manchester United	...	68.0	15.0	21.0	13.0	90.0	85.0	87.0	88.0	94.0	€138.6M
4	4	192985	K. De Bruyne	27	https://cdn.sofifa.org/players/4/19/192985.png	Belgium	https://cdn.sofifa.org/flags/7.png	91	92	Manchester City	...	88.0	68.0	58.0	51.0	15.0	13.0	5.0	10.0	13.0	€196.4M

5 rows × 89 columns

11.5.9.1 Histogram: Distribution of Values

Use Case: Understanding the distribution of player skill levels

Function: plt.hist(data, bins=number, color='color')

plt.figure(figsize=(8,5))

plt.hist(fifa.Overall, color='#abcdef')

plt.ylabel('Number of Players')
plt.xlabel('Skill Level')
plt.title('Distribution of Player Skills in FIFA 2018')

Text(0.5, 1.0, 'Distribution of Player Skills in FIFA 2018')

Observation: Player skills follow a roughly normal distribution, with most players having moderate skills (60-70 range).

11.5.9.2 Bar Chart: Comparing Categories

Use Case: Comparing counts between categories

Function: plt.bar(categories, values, color='color')

# plotting bar chart for the best players
plt.figure(figsize=(8,5))

foot_preference = fifa['Preferred Foot'].value_counts()

plt.bar(['Left', 'Right'], [foot_preference.iloc[1], foot_preference.iloc[0]], color='#abcdef')

plt.ylabel('Number of Players')
plt.title('Foot Preference of FIFA Players');

Observation: More FIFA players prefer their right foot over their left.

11.5.9.3 Pie Chart: Parts of a Whole

Use Case: Showing proportions/percentages

Function: plt.pie(values, labels=labels, autopct='format')

left = fifa.loc[fifa['Preferred Foot'] == 'Left'].count().iloc[0]
right = fifa.loc[fifa['Preferred Foot'] == 'Right'].count().iloc[0]

plt.figure(figsize=(8,5))

labels = ['Left', 'Right']
colors = ['#abcdef', '#aabbcc']

plt.pie([left, right], labels = labels, colors=colors, autopct='%.2f %%')

plt.title('Foot Preference of FIFA Players');

Tip: The autopct='%.2f %%' parameter automatically calculates and displays percentages.

11.5.9.4 Advanced Pie Chart with Explode

You can explode (pull out) specific slices to emphasize them:

plt.figure(figsize=(8,5), dpi=100)

plt.style.use('ggplot')

fifa.Weight = [int(x.strip('lbs')) if type(x)==str else x for x in fifa.Weight]

light = fifa.loc[fifa.Weight < 125].count().iloc[0]
light_medium = fifa[(fifa.Weight >= 125) & (fifa.Weight < 150)].count().iloc[0]
medium = fifa[(fifa.Weight >= 150) & (fifa.Weight < 175)].count().iloc[0]
medium_heavy = fifa[(fifa.Weight >= 175) & (fifa.Weight < 200)].count().iloc[0]
heavy = fifa[fifa.Weight >= 200].count().iloc[0]

weights = [light,light_medium, medium, medium_heavy, heavy]
label = ['under 125', '125-150', '150-175', '175-200', 'over 200']
explode = (.4,.2,0,0,.4)

plt.title('Weight of Professional Soccer Players (lbs)')

plt.pie(weights, labels=label, explode=explode, pctdistance=0.8,autopct='%.2f %%');

Advanced Features:

explode parameter pulls out slices (0 = normal, 0.4 = pulled out significantly)
plt.style.use('ggplot') applies a different visual style
pctdistance controls where percentages are positioned

11.5.9.5 Box Plot: Distribution Summary

Use Case: Comparing distributions across groups, identifying outliers

Function: plt.boxplot(data_list, tick_labels=labels)

plt.figure(figsize=(5,8), dpi=100)

plt.style.use('default')

barcelona = fifa.loc[fifa.Club == "FC Barcelona"]['Overall']
madrid = fifa.loc[fifa.Club == "Real Madrid"]['Overall']
revs = fifa.loc[fifa.Club == "New England Revolution"]['Overall']

# bp = plt.boxplot([barcelona, madrid, revs], labels=['a','b','c'], boxprops=dict(facecolor='red'))
bp = plt.boxplot([barcelona, madrid, revs], labels=['FC Barcelona','Real Madrid','NE Revolution'], patch_artist=True, medianprops={'linewidth': 2})

plt.title('Professional Soccer Team Comparison')
plt.ylabel('FIFA Overall Rating')

for box in bp['boxes']:
    # change outline color
    box.set(color='#4286f4', linewidth=2)
    # change fill color
    box.set(facecolor = '#e0e0e0' )
    # change hatch
    #box.set(hatch = '/')

Box Plot Elements:

Box = 25th to 75th percentile (middle 50% of data)
Orange line = Median
Whiskers = Extend to show data range
Circles = Outliers

Observation: FC Barcelona and Real Madrid have higher overall player ratings compared to New England Revolution.

11.5.9.6 Strengths of Matplotlib pyplot

✅ Maximum Control - Customize every element
✅ Publication Quality - Professional output suitable for papers
✅ Extensive Plot Types - Line, scatter, bar, histogram, pie, box, and many more
✅ Well Documented - Large community and comprehensive documentation
✅ Foundation - Base for Pandas, Seaborn, and other libraries

11.5.9.7 Essential Best Practices

Set Figure Size Early

plt.figure(figsize=(width, height))  # Before plt.plot()

Always Label Your Axes

plt.xlabel('X-axis label (with units!)')
plt.ylabel('Y-axis label (with units!)')

Add Descriptive Titles

plt.title('Clear description of what the plot shows')

Include Legends for Multiple Lines
```
plt.legend(['Series 1', 'Series 2'])
```

Use Semicolons in Jupyter

plt.plot(x, y);  # Suppresses unwanted text output

Save High-Quality Figures

plt.savefig('filename.png', dpi=300, bbox_inches='tight')

11.5.9.8 When to Use pyplot

Best For:

Creating custom visualizations with specific requirements
Building complex multi-panel layouts (covered in next chapter)
Fine-tuning every visual element
Generating publication-ready figures
When you need maximum control

11.5.10 What’s Next?

You’ve mastered the pyplot interface - the state-based, MATLAB-style approach to plotting.

In the next chapter, we’ll explore Matplotlib’s Object-Oriented Interface, which gives you even more control and is better for:

Creating complex multi-panel figures
Writing reusable plotting functions
Managing multiple figures simultaneously
Professional data visualization workflows

Preview of OO Interface:

# pyplot style (what we just learned)
plt.plot(x, y)
plt.xlabel('X Label')

# Object-oriented style (next chapter)
fig, ax = plt.subplots()
ax.plot(x, y)
ax.set_xlabel('X Label')

The OO interface is the professional standard for complex visualizations, but everything you learned about styling, colors, and plot types applies equally to both interfaces!

11.6 Plotting with Seaborn

Seaborn is a powerful data visualization library built on top of Matplotlib, designed to make statistical plots easier and more attractive. It provides:

High-level interface for drawing attractive statistical graphics
Beautiful default styles that are publication-ready out of the box
Seamless Pandas integration with direct DataFrame support
Built-in statistical computations (confidence intervals, regression lines, etc.)
Specialized plot types for statistical analysis

Think of it this way:

┌─────────────────────────────────────┐
│     Pandas .plot()                  │  ← Quickest, least control
├─────────────────────────────────────┤
│     Seaborn                         │  ← Beautiful + Statistical
├─────────────────────────────────────┤
│     Matplotlib                      │  ← Most control, most verbose
└─────────────────────────────────────┘

11.6.1 Step 1: Why Choose Seaborn Over Matplotlib?

Seaborn addresses several pain points of using Matplotlib directly:

11.6.1.1 Problem 1: Default Aesthetics

Matplotlib Challenge: - Default styles can look dated - Requires manual styling for professional appearance - Grid, colors, and fonts need explicit configuration

Seaborn Solution: - Modern, publication-ready styles out of the box - Professional appearance with zero configuration - Multiple preset themes for different contexts

11.6.1.2 Problem 2: Statistical Visualizations

Matplotlib Challenge: - Requires extensive code for statistical plots - Manual calculation of confidence intervals - Complex code for regression lines and error bars

Seaborn Solution: - Built-in statistical functions - Automatic confidence interval computation - One-line regression plots with regplot()

11.6.1.3 Problem 3: DataFrame Integration

Matplotlib Challenge: - Works primarily with arrays and lists - Requires extracting columns manually - No automatic label generation from column names

Seaborn Solution: - Native DataFrame support via data parameter - Column names automatically become labels - Natural, intuitive syntax: x='column_name'

11.6.1.4 Problem 4: Complex Multi-Panel Plots

Matplotlib Challenge: - Subplots require significant boilerplate code - Manual management of figure and axes objects - Difficult to create consistent multi-plot layouts

Seaborn Solution: - FacetGrid for automatic multi-panel layouts - PairPlot for pairwise relationship matrices - Simplified API for complex visualizations

11.6.2 Step 2: Setup and Aesthetic Customization

Before creating plots, let’s learn how to configure Seaborn’s appearance.

First, let’s import Seaborn and explore its built-in datasets:

Seaborn Datasets:

Seaborn comes with 17 built-in datasets, perfect for learning and practice. This means you can focus on visualization techniques without spending time finding and cleaning data.

import seaborn as sns
# get names of the builtin dataset
sns.get_dataset_names()

['anagrams',
 'anscombe',
 'attention',
 'brain_networks',
 'car_crashes',
 'diamonds',
 'dots',
 'dowjones',
 'exercise',
 'flights',
 'fmri',
 'geyser',
 'glue',
 'healthexp',
 'iris',
 'mpg',
 'penguins',
 'planets',
 'seaice',
 'taxis',
 'tips',
 'titanic']

Great! Now you can use any of these datasets with sns.load_dataset('dataset_name').

11.6.2.1 Customizing Plot Aesthetics

Seaborn provides powerful functions to customize the visual appearance of plots:

Style Options:

Seaborn offers five built-in styles via sns.set_style():

Style	Description	Best For
`"whitegrid"`	White background with grid lines	Most presentations and reports (recommended)
`"darkgrid"`	Dark background with grid lines	Data with many points or intricate details
`"white"`	Simple white background, no grid	Minimalist aesthetic, emphasis on data
`"dark"`	Dark background without grid	Emphasizing data points, reducing distractions
`"ticks"`	White background with axis ticks	Adding detail for precise reference

Let’s apply a style:

sns.set_style("whitegrid")

Perfect! Now all our Seaborn plots will have a clean, professional appearance with white background and gridlines.

11.6.3 Step 3: Distribution Plots

Distribution plots help you understand how your data is spread out. Let’s explore using the famous Iris flower dataset.

Load the Dataset:

# Load data into a Pandas dataframe
flowers_df = sns.load_dataset("iris")
flowers_df.head()

	sepal_length	sepal_width	petal_length	petal_width	species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa

Dataset: The Iris dataset contains measurements of 150 iris flowers from three different species. Let’s visualize the relationships between features.

11.6.4 Step 4: Relational Plots - Scatter Plots

Scatter plots show relationships between two numerical variables. Seaborn’s scatterplot() makes this easy.

11.6.4.1 Basic Scatter Plot

sns.scatterplot(x=flowers_df.sepal_length, y=flowers_df.sepal_width);

Observation: The plot shows a general pattern, but we can see distinct clusters. This suggests different groups might exist in the data.

11.6.4.2 Adding a Third Dimension with Hue

The real power of Seaborn comes from easily adding additional dimensions to your plots. The hue parameter colors points based on a categorical variable.

First, let’s check what species we have:

flowers_df.species.unique()

array(['setosa', 'versicolor', 'virginica'], dtype=object)

sns.scatterplot(x=flowers_df.sepal_length, y=flowers_df.sepal_width, hue=flowers_df.species, s=100);

Much more informative! Now we can clearly see:

Setosa flowers have smaller sepal length but larger sepal width
Virginica flowers show the opposite pattern (longer, narrower sepals)
Versicolor falls in between
The three species are clearly separable based on these measurements

Key Insight: The hue parameter is one of Seaborn’s most powerful features - it adds a third dimension to 2D plots with automatic color coding and legend generation.

11.6.4.3 Integrating Seaborn with Matplotlib

Since Seaborn is built on Matplotlib, you can use Matplotlib functions to customize Seaborn plots further:

plt.figure(figsize=(12, 6))
plt.title('Sepal Dimensions')

sns.scatterplot(x=flowers_df.sepal_length, 
                y=flowers_df.sepal_width, 
                hue=flowers_df.species,
                s=100);

Perfect combination! Seaborn creates beautiful plots quickly, and Matplotlib adds fine-tuning.

11.6.4.4 DataFrame Integration - The Recommended Approach

Instead of passing Series objects, Seaborn works best with the data parameter and column names:

plt.title('Sepal Dimensions')
sns.scatterplot(x='sepal_length', 
                y='sepal_width', 
                hue='species',
                s=100,
                data=flowers_df);

Benefits:

Cleaner, more readable code
Easier to modify (just change column names)
Consistent with Seaborn’s design philosophy

11.6.5 Step 5: Distribution Plots - Histograms

Histograms show the frequency distribution of a single variable. Let’s visualize sepal width distribution.

11.6.5.1 Basic Histogram

sns.histplot(data=flowers_df, x='sepal_width');

Observation: The distribution appears roughly bell-shaped (normal), with most values around 3.0.

11.6.5.2 Adding KDE (Kernel Density Estimate)

KDE creates a smooth curve showing the distribution’s shape:


sns.histplot(data=flowers_df, x='sepal_width', kde=True);

Enhanced! The smooth KDE curve helps visualize the overall distribution shape.

11.6.5.3 Comparing Distributions with Hue

Use hue to compare distributions across categories:

# adding hue
sns.histplot(data=flowers_df, x="sepal_width", hue="species");

Insight: Different species have different sepal width distributions:

Setosa tends to have wider sepals (peak around 3.4)
Versicolor and Virginica have narrower sepals (peaks around 2.8-3.0)

11.6.6 Step 6: Categorical Plots - Bar Plots

Bar plots show aggregate statistics (like mean) for different categories. Let’s use the Tips dataset.

Load Tips Dataset:

Dataset: Tips received by restaurant servers, including day, time, party size, and whether the customer smoked.

11.6.6.1 Basic Bar Plot

By default, barplot() shows the mean value with confidence interval error bars:

tips_df = sns.load_dataset("tips")
sns.barplot(x='day', y='total_bill', data=tips_df);

Observation: Weekend (Saturday/Sunday) bills tend to be higher on average than weekday bills.

Note: The black lines are confidence intervals showing the uncertainty in the mean estimate.

11.6.6.2 Adding Hue for Comparison

Compare tips by gender across different days:

sns.barplot(x='day', y='tip', hue='sex', data=tips_df);

Insight: Males tend to give slightly higher tips across most days.

11.6.6.3 Horizontal Bar Charts

Simply swap the x and y parameters to make bars horizontal (useful for long category names):

# make the bars horizontal simply by switching the axes
sns.barplot(x='tip', y='day', hue='sex', data=tips_df);

Horizontal bars are easier to read when you have many categories or long labels.

11.6.7 Step 7: Box Plots - Distribution Summary

Box plots provide a statistical summary of distributions and are excellent for comparing groups and identifying outliers.

11.6.7.1 Understanding Box Plots

Box plots display five key statistics that describe a distribution:

Box Plot Elements: - Box = 25th to 75th percentile (middle 50% of data - the interquartile range) - Line inside box = Median (50th percentile) - Whiskers = Extend to show the data range (typically 1.5 × IQR) - Individual points = Outliers (unusually high or low values)

11.6.7.2 Creating a Box Plot

Compare total bills across days, separated by gender:

sns.boxplot(data=tips_df, y='total_bill', x='day', hue='sex');

What can we observe from this box plot?

Key Insights from the Box Plot:

Weekend vs Weekday Patterns:
- Saturday and Sunday show higher median total bills (thicker middle line)
- Thursday has the lowest median bills
Gender Differences:
- Males generally have higher median total bills than females across all days
- The difference is most pronounced on weekends
Variability:
- Saturday shows the highest variability (tallest box and longest whiskers)
- Thursday shows the most consistent bills (shortest box)
Outliers:
- Several outlier points visible on most days (individual dots above whiskers)
- These represent unusually large bills
- Outliers are more common on weekends
Distribution Shape:
- Most distributions are slightly right-skewed (median closer to bottom of box)
- This suggests occasional very high bills pull the mean up

11.6.8 Step 8: Matrix Visualizations - Heatmaps

Heatmaps are excellent for visualizing 2D data matrices, especially for showing patterns in tabular data.

11.6.8.1 Understanding Heatmaps

Heatmaps represent 2-dimensional data (like a matrix or table) using colors. Darker/brighter colors indicate higher or lower values.

Use Case: Let’s visualize airline passenger data to see patterns over time.

Dataset Structure: Rows represent months, columns represent years, values show passenger counts (in thousands).

Note: The pivot() function restructures data for heatmap visualization. You’ll learn more about pivoting in later chapters.

11.6.8.2 Creating a Basic Heatmap

flights_df = sns.load_dataset("flights").pivot(index="month", columns="year", values="passengers")
flights_df
# you will learn pivot in the later chapters

year	1949	1950	1951	1952	1953	1954	1955	1956	1957	1958	1959	1960
month
Jan	112	115	145	171	196	204	242	284	315	340	360	417
Feb	118	126	150	180	196	188	233	277	301	318	342	391
Mar	132	141	178	193	236	235	267	317	356	362	406	419
Apr	129	135	163	181	235	227	269	313	348	348	396	461
May	121	125	172	183	229	234	270	318	355	363	420	472
Jun	135	149	178	218	243	264	315	374	422	435	472	535
Jul	148	170	199	230	264	302	364	413	465	491	548	622
Aug	148	170	199	242	272	293	347	405	467	505	559	606
Sep	136	158	184	209	237	259	312	355	404	404	463	508
Oct	119	133	162	191	211	229	274	306	347	359	407	461
Nov	104	114	146	172	180	203	237	271	305	310	362	390
Dec	118	140	166	194	201	229	278	306	336	337	405	432

flights_df is a matrix with one row for each month and one column for each year. The values show the number of passengers (in thousands) that visited the airport in a specific month of a year. We can use the sns.heatmap function to visualize the footfall at the airport.

plt.title("No. of Passengers (1000s)")
sns.heatmap(flights_df);

Reading the Heatmap:

Brighter colors indicate higher passenger counts. From this visualization we can see:

Key Patterns Observed:

Seasonal Pattern (Vertical):
- Footfall is highest in July & August (summer months) - brightest colors
- Lowest in winter months (November-February) - darker colors
Growth Trend (Horizontal):
- Passenger numbers increase year over year
- Each year shows generally brighter colors than the previous

11.6.8.3 Enhanced Heatmap with Annotations

Add actual numbers and customize colors for clarity:

# fmt = "d" decimal integer. output are the number in base 10
plt.title("No. of Passengers (1000s)")
sns.heatmap(flights_df, fmt="d", annot=True, cmap='Blues');

Enhancements:

annot=True displays actual values in each cell
fmt="d" formats numbers as integers (no decimals)
cmap='Blues' uses a blue color scheme (darker = more passengers)

Much clearer! Now we can see exact passenger counts while still benefiting from the color visualization.

11.6.9 Step 9: Correlation Matrices

One of the most powerful uses of heatmaps is visualizing correlation matrices.

11.6.9.1 What is a Correlation Matrix?

A correlation matrix is a special heatmap where:

Values represent correlation coefficients between pairs of variables
Range from -1 (perfect negative correlation) to +1 (perfect positive correlation)
Shows strength and direction of linear relationships

Visual Guide:

Dark red/positive = Variables increase together
Dark blue/negative = One increases as other decreases
Light colors near 0 = Little to no linear relationship

Important Note: In this course, correlation always refers to Pearson’s correlation coefficient, which measures linear association between variables.

Critical Caveat: Correlation does NOT imply causation! A strong correlation means variables move together, but doesn’t tell us if one causes the other.

11.6.9.2 Computing Correlations with Pandas

Pandas provides built-in functions for correlation analysis:

Pandas Correlation Functions:

.corr() - Computes pairwise correlation between all numeric columns of a DataFrame
.corrwith() - Computes correlation of DataFrame columns with another DataFrame or Series

Let’s load a survey dataset to explore correlations:

#Pairwise correlation amongst all columns
survey_data = pd.read_csv('./Datasets/survey_data_clean.csv')

survey_data.head()

	Timestamp	fav_alcohol	parties_per_month	smoke	weed	introvert_extrovert	love_first_sight	learning_style	left_right_brained	personality_type	...	used_python_before	dominant_hand	childhood_in_US	gender	region_of_residence	political_affliation	cant_change_math_ability	can_change_math_ability
0	2022/09/13 1:43:34 pm GMT-5	I don't drink	1.0	No	Occasionally	Introvert	0	Visual (learn best through images or graphic o...	Left-brained (logic, science, critical thinkin...	INFJ	...	1	Right	1	Female	Northeast	Democrat	0	1
1	2022/09/13 5:28:17 pm GMT-5	Hard liquor/Mixed drink	3.0	No	Occasionally	Extrovert	0	Visual (learn best through images or graphic o...	Left-brained (logic, science, critical thinkin...	ESFJ	...	1	Right	1	Male	West	Democrat	0	1
2	2022/09/13 7:56:38 pm GMT-5	Hard liquor/Mixed drink	3.0	No	Yes	Introvert	0	Kinesthetic (learn best through figuring out h...	Left-brained (logic, science, critical thinkin...	ISTJ	...	0	Right	0	Female	International	No affiliation	0	1
3	2022/09/13 10:34:37 pm GMT-5	Hard liquor/Mixed drink	12.0	No	No	Extrovert	0	Visual (learn best through images or graphic o...	Left-brained (logic, science, critical thinkin...	ENFJ	...	0	Right	1	Female	Southeast	Democrat	0	1
4	2022/09/14 4:46:19 pm GMT-5	I don't drink	1.0	No	No	Extrovert	1	Reading/Writing (learn best through words ofte...	Right-brained (creative, art, imaginative, int...	ENTJ	...	0	Right	1	Female	Northeast	Democrat	1	0

5 rows × 51 columns

Compute the pairwise correlation matrix:

#Pairwise correlation amongst all columns
survey_data.select_dtypes(include='number').corr()

	parties_per_month	love_first_sight	num_insta_followers	expected_marriage_age	expected_starting_salary	minutes_ex_per_week	sleep_hours_per_day	farthest_distance_travelled	fav_number	internet_hours_per_day	...	procrastinator	num_clubs	student_athlete	AP_stats	used_python_before	childhood_in_US	cant_change_math_ability	can_change_math_ability	math_is_genetic	much_effort_is_lack_of_talent
parties_per_month	1.000000	0.096129	0.239705	-0.064079	0.114881	0.195561	-0.052542	-0.017081	-0.050139	0.087390	...	-0.056871	-0.010514	0.290830	-0.013222	-0.040033	0.081905	-0.052912	0.055575	-0.013374	-0.029838
love_first_sight	0.096129	1.000000	-0.024010	-0.084406	0.080138	0.099244	-0.025378	-0.075539	0.105095	-0.007652	...	0.033951	0.083342	0.014595	-0.062992	-0.034692	-0.118260	0.005254	0.020758	-0.003710	0.013376
num_insta_followers	0.239705	-0.024010	1.000000	-0.130157	0.127226	0.099341	-0.042421	0.011308	-0.124763	-0.028427	...	-0.089871	0.265958	0.044807	0.005947	-0.016201	0.072622	-0.150658	0.130774	-0.018411	-0.165899
expected_marriage_age	-0.064079	-0.084406	-0.130157	1.000000	-0.014881	-0.088073	0.182009	-0.024038	-0.008924	-0.029772	...	-0.020012	-0.137069	-0.036122	0.010447	0.052727	0.053759	-0.072163	0.087633	-0.086898	0.052813
expected_starting_salary	0.114881	0.080138	0.127226	-0.014881	1.000000	0.134065	-0.005078	-0.028329	-0.028125	0.017479	...	0.054273	-0.100922	-0.026219	-0.084894	-0.094541	0.081142	-0.011609	0.019171	0.078694	0.097265
minutes_ex_per_week	0.195561	0.099244	0.099341	-0.088073	0.134065	1.000000	0.049593	-0.153188	0.038758	-0.028457	...	-0.045149	-0.024572	0.576301	-0.062544	0.057760	0.235492	-0.101282	0.134430	-0.047772	-0.045141
sleep_hours_per_day	-0.052542	-0.025378	-0.042421	0.182009	-0.005078	0.049593	1.000000	0.104175	-0.021909	0.017435	...	-0.176579	-0.163860	0.058361	-0.013909	0.096528	-0.059468	-0.058086	0.012174	0.027052	-0.022025
farthest_distance_travelled	-0.017081	-0.075539	0.011308	-0.024038	-0.028329	-0.153188	0.104175	1.000000	-0.108661	0.049450	...	0.032492	-0.045214	-0.158027	0.010580	0.012353	-0.282821	-0.046074	0.017935	0.110037	0.046895
fav_number	-0.050139	0.105095	-0.124763	-0.008924	-0.028125	0.038758	-0.021909	-0.108661	1.000000	-0.013070	...	0.085508	-0.013696	-0.014435	0.091011	0.030736	0.072894	-0.032534	0.034319	-0.063692	-0.073777
internet_hours_per_day	0.087390	-0.007652	-0.028427	-0.029772	0.017479	-0.028457	0.017435	0.049450	-0.013070	1.000000	...	0.048239	0.064527	-0.017944	0.001818	0.051970	0.033120	-0.033902	0.050258	0.190205	-0.053708
only_child	-0.142519	0.124345	-0.152184	-0.043141	-0.088648	-0.123371	0.038126	0.214377	-0.024419	-0.035022	...	0.073415	-0.065484	0.064136	0.048031	-0.139898	-0.387711	0.023089	-0.019982	0.058226	0.092372
num_majors_minors	-0.073127	0.108730	0.050431	-0.055280	0.021278	0.044450	-0.024339	-0.012779	0.023903	-0.073775	...	-0.073806	0.311266	-0.035500	-0.068640	-0.073388	-0.153529	-0.077501	0.024734	-0.125809	-0.064939
high_school_GPA	0.295646	0.069288	0.147402	0.017052	0.053354	-0.076471	-0.036904	-0.064116	-0.023081	-0.034485	...	0.031561	-0.020854	0.006332	0.066837	0.072777	0.005606	-0.095025	0.093416	-0.082620	0.001373
NU_GPA	-0.080548	-0.114041	0.004702	0.011925	-0.048069	-0.108177	0.143997	0.038238	-0.307656	-0.014531	...	-0.269552	0.016724	-0.027378	-0.026544	-0.008536	-0.028968	0.002094	-0.137330	0.036731	0.047840
age	-0.032771	0.142384	-0.230698	0.060416	-0.102632	-0.040906	-0.035890	0.018811	0.096818	0.017515	...	-0.005892	-0.127760	-0.038315	-0.026959	0.009924	-0.152784	-0.005954	0.014759	-0.009315	-0.126370
height	-0.005405	0.216072	0.009318	0.044577	0.151517	0.182090	-0.010650	-0.235067	0.041298	-0.023174	...	0.063263	0.212038	0.080953	0.022484	0.016110	0.160309	-0.055641	0.101811	-0.064383	0.028509
height_father	0.126741	0.029419	0.179684	0.026949	0.011450	0.156227	0.097593	-0.118669	-0.032717	-0.047314	...	-0.111183	0.022701	0.155003	-0.010982	-0.006480	0.137934	-0.019593	0.008157	0.010222	0.060439
height_mother	0.079121	0.082684	0.129716	0.075316	0.033947	0.114181	-0.044089	-0.134582	-0.029568	-0.091417	...	-0.078265	0.091390	0.053258	-0.100647	-0.021396	0.119292	0.027120	0.034961	-0.035449	0.074492
procrastinator	-0.056871	0.033951	-0.089871	-0.020012	0.054273	-0.045149	-0.176579	0.032492	0.085508	0.048239	...	1.000000	0.078341	0.094363	0.003053	-0.016254	-0.090868	0.002462	0.084419	-0.001738	0.081471
num_clubs	-0.010514	0.083342	0.265958	-0.137069	-0.100922	-0.024572	-0.163860	-0.045214	-0.013696	0.064527	...	0.078341	1.000000	-0.084562	0.087438	0.115062	-0.021044	-0.136249	0.070002	-0.090570	-0.108851
student_athlete	0.290830	0.014595	0.044807	-0.036122	-0.026219	0.576301	0.058361	-0.158027	-0.014435	-0.017944	...	0.094363	-0.084562	1.000000	-0.040686	-0.049288	0.082888	-0.066667	-0.022576	-0.060523	0.121232
AP_stats	-0.013222	-0.062992	0.005947	0.010447	-0.084894	-0.062544	-0.013909	0.010580	0.091011	0.001818	...	0.003053	0.087438	-0.040686	1.000000	0.089517	0.106584	0.081109	0.029743	-0.048375	-0.018043
used_python_before	-0.040033	-0.034692	-0.016201	0.052727	-0.094541	0.057760	0.096528	0.012353	0.030736	0.051970	...	-0.016254	0.115062	-0.049288	0.089517	1.000000	0.041928	-0.011217	0.156806	0.088566	0.023366
childhood_in_US	0.081905	-0.118260	0.072622	0.053759	0.081142	0.235492	-0.059468	-0.282821	0.072894	0.033120	...	-0.090868	-0.021044	0.082888	0.106584	0.041928	1.000000	-0.008575	0.057185	-0.178003	-0.013098
cant_change_math_ability	-0.052912	0.005254	-0.150658	-0.072163	-0.011609	-0.101282	-0.058086	-0.046074	-0.032534	-0.033902	...	0.002462	-0.136249	-0.066667	0.081109	-0.011217	-0.008575	1.000000	-0.672777	0.294544	0.101835
can_change_math_ability	0.055575	0.020758	0.130774	0.087633	0.019171	0.134430	0.012174	0.017935	0.034319	0.050258	...	0.084419	0.070002	-0.022576	0.029743	0.156806	0.057185	-0.672777	1.000000	-0.361546	-0.131047
math_is_genetic	-0.013374	-0.003710	-0.018411	-0.086898	0.078694	-0.047772	0.027052	0.110037	-0.063692	0.190205	...	-0.001738	-0.090570	-0.060523	-0.048375	0.088566	-0.178003	0.294544	-0.361546	1.000000	0.154083
much_effort_is_lack_of_talent	-0.029838	0.013376	-0.165899	0.052813	0.097265	-0.045141	-0.022025	0.046895	-0.073777	-0.053708	...	0.081471	-0.108851	0.121232	-0.018043	0.023366	-0.013098	0.101835	-0.131047	0.154083	1.000000

28 rows × 28 columns

The matrix is hard to read! Let’s find which features correlate most with NU_GPA:

survey_data.select_dtypes(include='number').corrwith(survey_data.NU_GPA).sort_values(ascending = False)

NU_GPA                           1.000000
sleep_hours_per_day              0.143997
num_majors_minors                0.141988
only_child                       0.106440
much_effort_is_lack_of_talent    0.047840
farthest_distance_travelled      0.038238
math_is_genetic                  0.036731
num_clubs                        0.016724
expected_marriage_age            0.011925
num_insta_followers              0.004702
cant_change_math_ability         0.002094
used_python_before              -0.008536
internet_hours_per_day          -0.014531
AP_stats                        -0.026544
student_athlete                 -0.027378
childhood_in_US                 -0.028968
high_school_GPA                 -0.030883
height_father                   -0.040120
expected_starting_salary        -0.048069
age                             -0.052039
height_mother                   -0.079276
parties_per_month               -0.080548
height                          -0.099082
minutes_ex_per_week             -0.108177
love_first_sight                -0.114041
can_change_math_ability         -0.137330
procrastinator                  -0.269552
fav_number                      -0.307656
dtype: float64

Better! Now we can see correlations with NU_GPA sorted from strongest to weakest.

11.6.9.3 Visualizing the Correlation Matrix

Now let’s create a heatmap to see all correlations at once:

sns.set(rc={'figure.figsize':(12,10)})
sns.heatmap(survey_data.select_dtypes(include='number').corr());

Key Findings from the Correlation Heatmap:

Strong Positive Correlation:
- student_athlete is strongly positively correlated with minutes_ex_per_week
- This makes sense: student athletes exercise more
Strong Negative Correlation:
- procrastinator is strongly negatively correlated with NU_GPA
- Students who procrastinate tend to have lower GPAs
Diagonal:
- All 1.0 values (variables perfectly correlate with themselves)
Symmetry:
- Matrix is symmetric: correlation(A, B) = correlation(B, A)

11.6.10 Interpreting Correlation Coefficients

Understanding what correlation values mean:

Coefficient Range	Interpretation	Relationship Strength
0.9 to 1.0	Very strong positive	Nearly perfect linear relationship
0.7 to 0.9	Strong positive	Clear linear pattern
0.5 to 0.7	Moderate positive	Noticeable but imperfect pattern
0.3 to 0.5	Weak positive	Slight tendency
-0.3 to 0.3	Little to none	No linear relationship
-0.5 to -0.3	Weak negative	Slight inverse tendency
-0.7 to -0.5	Moderate negative	Noticeable inverse pattern
-0.9 to -0.7	Strong negative	Clear inverse linear pattern
-1.0 to -0.9	Very strong negative	Nearly perfect inverse relationship

Critical Caveats:

Correlation ≠ Causation
- Strong correlation doesn’t mean one variable causes changes in the other
- Example: Ice cream sales and drowning deaths correlate (both happen in summer), but ice cream doesn’t cause drowning!
Linear Only
- Pearson correlation only captures linear relationships
- Variables might have strong non-linear relationships with correlation near 0
Outliers Matter
- A few extreme values can heavily influence correlation coefficients
- Always visualize your data, don’t just look at numbers
Hidden Variables
- Third variables (confounders) might explain apparent correlations
- Consider lurking variables before drawing conclusions

11.6.11 Summary: Seaborn Best Practices

Now that you’ve learned Seaborn, here are essential best practices:

11.6.11.1 When to Use Seaborn

Best For:

Statistical visualizations (distributions, relationships, comparisons)
Quick, beautiful plots with minimal code
Exploring relationships in DataFrames
Creating publication-ready figures with default settings
Plots with categorical and numerical data together

Not Ideal For:

Highly customized, non-standard plot types (use Matplotlib)
Interactive visualizations (use Plotly or Bokeh)
3D plots (use Matplotlib’s mplot3d or Plotly)
Very simple exploratory plots (Pandas might be faster)

11.6.11.2 Essential Seaborn Workflow

1. Set Aesthetics First:

sns.set_style("whitegrid")      # Professional appearance
sns.set_context("notebook")     # Appropriate sizing
sns.set_palette("colorblind")   # Accessible colors

2. Use the data Parameter:

# Recommended: Clear and readable
sns.scatterplot(data=df, x='col1', y='col2', hue='col3')

# Avoid: Harder to read and modify
sns.scatterplot(x=df['col1'], y=df['col2'], hue=df['col3'])

3. Leverage hue for Multi-Dimensional Plots:

Adds a third dimension via color
Works across most plot types
Automatically generates legends

4. Combine with Matplotlib for Fine-Tuning:

sns.boxplot(data=df, x='category', y='value')
plt.title('Custom Title')           # Matplotlib customization
plt.ylabel('Custom Y Label')
plt.xticks(rotation=45)

5. Choose the Right Plot Type:

Goal	Plot Type	Seaborn Function
Single variable distribution	Histogram	`histplot()`
	Smooth distribution	`kdeplot()`
Compare categories	Bar chart	`barplot()`
	Box plot	`boxplot()`
Two numeric variables	Scatter	`scatterplot()`
Correlations	Heatmap	`heatmap()`
All pairwise relationships	Pair plot	`pairplot()`

11.6.11.3 Color Palette Guide

For Categorical Data (no order):

"deep" - default, good contrast
"colorblind" - accessible (highly recommended!)
"Set2", "Set3" - soft, professional

For Sequential Data (low to high):

"Blues", "Greens", "Reds" - single hue
"viridis", "plasma", "cividis" - perceptually uniform

For Diverging Data (meaningful midpoint):

"coolwarm" - blue to red through white
"RdBu" - red to blue
"vlag" - light to dark to light

11.6.12 Practice Exercises

Now apply what you’ve learned!

Exercise 1: Distribution Analysis

# Using the iris dataset:
# 1. Create a histogram of petal_length with KDE
# 2. Add hue by species
# 3. Add a title and customize figure size

Exercise 2: Categorical Comparison

# Using the tips dataset:
# 1. Create a box plot comparing total_bill across different times (Lunch vs Dinner)
# 2. Add hue by sex
# 3. Make it horizontal

Exercise 3: Correlation Analysis

# Using the iris dataset:
# 1. Compute the correlation matrix for all numeric columns
# 2. Create a heatmap with annotations
# 3. Use the 'coolwarm' color palette
# 4. What are the two most correlated features?

Exercise 4: Multi-Dimensional Scatter

# Using the tips dataset:
# 1. Create a scatter plot of total_bill vs tip
# 2. Use hue for time (Lunch/Dinner)
# 3. Use size for the party size
# 4. Add appropriate labels and title

Hint: Refer back to the examples in this section and the Seaborn documentation for parameter details!

11.7 Chapter Summary

Congratulations! You’ve completed a comprehensive journey through data visualization in Python. Let’s recap what you’ve accomplished.

11.7.1 Visualization Fundamentals

Data Understanding:

Distinguishing numeric vs. categorical data
Univariate, bivariate, and multivariate analysis
Matching data types to visualization types

11.7.2 Complete Plot Type Reference

Here’s a master reference of all plot types you’ve learned:

Plot Type	Purpose	Pandas	Matplotlib	Seaborn
Line Plot	Trends over time/continuous data	`df.plot()`	`plt.plot()`	`sns.lineplot()`
Scatter Plot	Relationship between two variables	`df.plot(kind='scatter')`	`plt.scatter()`	`sns.scatterplot()`
Histogram	Distribution of single variable	`df.plot(kind='hist')`	`plt.hist()`	`sns.histplot()`
Bar Chart	Compare categories	`df.plot(kind='bar')`	`plt.bar()`	`sns.barplot()`
Box Plot	Distribution summary + outliers	`df.plot(kind='box')`	`plt.boxplot()`	`sns.boxplot()`
Pie Chart	Part-to-whole relationships	`df.plot(kind='pie')`	`plt.pie()`	N/A
Heatmap	2D data matrix/correlations	N/A	`plt.imshow()`	`sns.heatmap()`
KDE Plot	Smooth distribution	`df.plot(kind='kde')`	N/A	`sns.kdeplot()`

11.7.2.1 The Three-Library Ecosystem

You’ve mastered Python’s three-tier approach to visualization:

┌─────────────────────────────────────┐
│  Pandas .plot()                     │  ← Quick & Easy
├─────────────────────────────────────┤
│  Seaborn                            │  ← Beautiful & Statistical  
├─────────────────────────────────────┤
│  Matplotlib pyplot                  │  ← Complete Control
└─────────────────────────────────────┘

11.7.3 When to Use Each Library

Is this exploratory analysis?
├─ Yes → Use Pandas .plot() for speed
└─ No
   ├─ Need statistical visualization?
   │  ├─ Yes → Use Seaborn
   │  └─ No → Continue below
   └─ Need custom/complex plot?
      ├─ Yes → Use Matplotlib
      └─ No → Use Seaborn (easier syntax)

11.7.4 Essential Best Practices

11.7.4.1 Universal Guidelines

Always Label Your Axes
- Include units: Temperature (°C) not just Temperature
- Make labels descriptive and self-explanatory
Add Informative Titles
- Describe what the plot shows
- Answer: What am I looking at?
Choose Appropriate Colors
- Use colorblind-friendly palettes ("colorblind" in Seaborn)
- Avoid red-green combinations
- Ensure sufficient contrast

Control Figure Size

# Pandas
df.plot(figsize=(12, 6))

# Matplotlib
plt.figure(figsize=(12, 6))

# Seaborn (use Matplotlib)
plt.figure(figsize=(12, 6))
sns.scatterplot(data=df, x='x', y='y')

Save High-Quality Figures

plt.savefig('figure.png', dpi=300, bbox_inches='tight')

11.7.4.2 Library-Specific Tips

Pandas:

Set time column as index for time series plots
Use rot=45 to rotate axis labels
Chain .plot() with other DataFrame operations

Matplotlib:

Use semicolons (;) in Jupyter to suppress unwanted output
Call plt.figure() before creating a new plot
Use fmt shorthand for quick styling: 'o-r' = circles, solid line, red

Seaborn:

Always set style first: sns.set_style("whitegrid")
Use data parameter with column names (not Series)
Leverage hue for multi-dimensional visualization
Combine with Matplotlib for fine-tuning

11.7.5 Pro Tip: Combine Libraries!

The most powerful approach is mixing libraries in a single plot:

# Create beautiful plot with Seaborn
sns.scatterplot(data=df, x='x', y='y', hue='category')

# Fine-tune with Matplotlib
plt.title('Custom Title', fontsize=16, fontweight='bold')
plt.axhline(y=0, color='red', linestyle='--', alpha=0.5)
plt.tight_layout()

# Works seamlessly!

This gives you:

Seaborn’s easy syntax and beautiful defaults
Matplotlib’s precise customization
Best of both worlds!

11.8 Independent Study

11.8.1 Practice exercise 1

Read survey_data_clean.csv

	Timestamp	fav_alcohol	parties_per_month	smoke	weed	introvert_extrovert	love_first_sight	learning_style	left_right_brained	personality_type	...	used_python_before	dominant_hand	childhood_in_US	gender	region_of_residence	political_affliation	cant_change_math_ability	can_change_math_ability
0	2022/09/13 1:43:34 pm GMT-5	I don't drink	1.0	No	Occasionally	Introvert	0	Visual (learn best through images or graphic o...	Left-brained (logic, science, critical thinkin...	INFJ	...	1	Right	1	Female	Northeast	Democrat	0	1
1	2022/09/13 5:28:17 pm GMT-5	Hard liquor/Mixed drink	3.0	No	Occasionally	Extrovert	0	Visual (learn best through images or graphic o...	Left-brained (logic, science, critical thinkin...	ESFJ	...	1	Right	1	Male	West	Democrat	0	1
2	2022/09/13 7:56:38 pm GMT-5	Hard liquor/Mixed drink	3.0	No	Yes	Introvert	0	Kinesthetic (learn best through figuring out h...	Left-brained (logic, science, critical thinkin...	ISTJ	...	0	Right	0	Female	International	No affiliation	0	1
3	2022/09/13 10:34:37 pm GMT-5	Hard liquor/Mixed drink	12.0	No	No	Extrovert	0	Visual (learn best through images or graphic o...	Left-brained (logic, science, critical thinkin...	ENFJ	...	0	Right	1	Female	Southeast	Democrat	0	1
4	2022/09/14 4:46:19 pm GMT-5	I don't drink	1.0	No	No	Extrovert	1	Reading/Writing (learn best through words ofte...	Right-brained (creative, art, imaginative, int...	ENTJ	...	0	Right	1	Female	Northeast	Democrat	1	0

5 rows × 51 columns

How does the expected marriage age of the people of STAT303-1 depend on their characteristics? We’ll use visualizations to answer this question.

11.8.1.1

Make a visualization that compares the mean expected_marriage_age of introverts and extroverts (use the variable introvert_extrovert). What insights do you obtain?

11.8.1.2

Does the mean expected_marriage_age of introverts and extroverts depend on whether they believe in love in first sight (variable name: love_first_sight)? Update the previous visualization to answer the question.

11.8.1.3

In addition to love_first_sight, does the mean expected_marriage_age of introverts and extroverts depend on whether they are a procrastinator (variable name: procrastinator)? Update the previous visualization to answer the question.

11.8.1.4

Is there any critical information missing in the above visualizations that, if revealed, may cast doubts on the patterns observed in them?

11.8.2 Practice exercise 2

Read Australia_weather.csv,

	Date	Location	MinTemp	MaxTemp	Rainfall	Evaporation	Sunshine	WindGustDir	WindGustSpeed	WindDir9am	...	Humidity3pm	Pressure9am	Pressure3pm	Cloud9am	Cloud3pm	Temp9am	Temp3pm	RainToday	RISK_MM	RainTomorrow
0	10/20/2010	Sydney	12.9	20.3	0.2	3.0	10.9	ENE	37	W	...	57	1028.8	1025.6	3	1	16.9	19.8	No	0.0	No
1	10/21/2010	Sydney	13.3	21.5	0.0	6.6	11.0	ENE	41	W	...	58	1025.9	1022.4	2	5	17.6	21.3	No	0.0	No
2	10/22/2010	Sydney	15.3	23.0	0.0	5.6	11.0	NNE	41	W	...	63	1021.4	1017.8	1	4	19.0	22.2	No	0.0	No
3	10/26/2010	Sydney	12.9	26.7	0.2	3.8	12.1	NE	33	W	...	56	1018.0	1015.0	1	5	17.8	22.5	No	0.0	No
4	10/27/2010	Sydney	14.8	23.8	0.0	6.8	9.6	SSE	54	SSE	...	69	1016.0	1014.7	2	7	20.2	20.6	No	1.8	Yes

5 rows × 24 columns

11.8.2.1

Create a histogram showing the distributions of maximum temperature in Sydney, Canberra and Melbourne.

11.8.2.2

Make a density plot showing the distributions of maximum temperature in Sydney, Canberra and Melbourne.

11.8.2.3

Show the distributions of the maximum and minimum temperatures across all locations in a single plot.

11.8.2.4

Create a scatter plot with a trendline for MinTemp and MaxTemp, including a confidence interval.

Hint: Using Seaborn, the regplot() function enables us to overlay a trendline on the scatter plot, complete with a 95% confidence interval for the trendline